{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# COMPSCI 389: Introduction to Machine Learning\n", "# Topic 5.4 Evaluation, One Last Time\n", "\n", "**Note:** This notebook is described in the slides, `5.4 Evaluation Part 4.pdf`. All of the important content within this notebook is in those slides, so you are not responsible for this notebook. However, you may reference this notebook to run the examples from the slides.\n", "\n", "The code below should be review. It:\n", "1. Imports the libraries we use\n", "2. Defines the evaluation metrics we use\n", "3. Defines the KNearestNeighbors model\n", "4. Defines the WeightedKNearestNeighbors model" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.neighbors import KDTree\n", "from sklearn.base import BaseEstimator\n", "from sklearn.model_selection import train_test_split\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "def mean_squared_error(predictions, labels):\n", " return np.mean((predictions - labels) ** 2)\n", "\n", "def root_mean_squared_error(predictions, labels):\n", " return np.sqrt(mean_squared_error(predictions, labels))\n", "\n", "def mean_absolute_error(predictions, labels):\n", " return np.mean(np.abs(predictions - labels))\n", "\n", "def r_squared(predictions, labels):\n", " ss_res = np.sum((labels - predictions) ** 2) # ss_res is the \"Sum of Squares of Residuals\"\n", " ss_tot = np.sum((labels - np.mean(labels)) ** 2) # ss_tot is the \"Total Sum of Squares\"\n", " return 1 - (ss_res / ss_tot)\n", "\n", "class KNearestNeighbors(BaseEstimator):\n", " # Add a constructor that stores the value of k (a hyperparameter)\n", " def __init__(self, k=3):\n", " self.k = k\n", "\n", " def fit(self, X, y):\n", " # Convert X and y to NumPy arrays if they are DataFrames\n", " if isinstance(X, pd.DataFrame):\n", " X = X.values\n", " if isinstance(y, pd.Series):\n", " y = y.values\n", "\n", " # Store the training data and labels\n", " self.X_data = X\n", " self.y_data = y\n", " \n", " # Create a KDTree for efficient nearest neighbor search\n", " self.tree = KDTree(X)\n", "\n", " return self\n", "\n", " def predict(self, X):\n", " # Convert X to a NumPy array if it's a DataFrame\n", " if isinstance(X, pd.DataFrame):\n", " X = X.values\n", "\n", " # Query the tree for the k nearest neighbors for all points in X\n", " dist, ind = self.tree.query(X, k=self.k)\n", "\n", " # Return the average label for the nearest neighbors of each query\n", " return np.mean(self.y_data[ind], axis=1)\n", " \n", "class WeightedKNearestNeighbors(BaseEstimator):\n", " # Add a constructor that stores the value of k and sigma (hyperparameters)\n", " def __init__(self, k=3, sigma=1.0):\n", " self.k = k\n", " self.sigma = sigma\n", "\n", " def fit(self, X, y):\n", " # Convert X and y to NumPy arrays if they are DataFrames\n", " if isinstance(X, pd.DataFrame):\n", " X = X.values\n", " if isinstance(y, pd.Series):\n", " y = y.values\n", "\n", " # Store the training data and labels\n", " self.X_data = X\n", " self.y_data = y\n", " \n", " # Create a KDTree for efficient nearest neighbor search\n", " self.tree = KDTree(X)\n", "\n", " return self\n", "\n", " def gaussian_kernel(self, distance):\n", " # Gaussian kernel function\n", " return np.exp(- (distance ** 2) / (2 * self.sigma ** 2))\n", "\n", " def predict(self, X):\n", " # Convert X to a NumPy array if it's a DataFrame\n", " if isinstance(X, pd.DataFrame):\n", " X = X.values\n", "\n", " # We will iteratively load predictions, so it starts empty\n", " predictions = []\n", " \n", " # Loop over rows in the query\n", " for x in X:\n", " # Query the tree for the k nearest neighbors\n", " dist, ind = self.tree.query([x], k=self.k)\n", "\n", " # Calculate weights using the Gaussian kernel\n", " weights = self.gaussian_kernel(dist[0])\n", "\n", " # Check if weights sum to zero. This happens when all points are very far, giving weights that round to zero, causing divison by zero later. In this case, revert to un-weighted (all weights are one).\n", " if np.sum(weights) == 0:\n", " # If weights sum to zero, assign equal weight to all neighbors\n", " weights = np.ones_like(weights)\n", "\n", " # Weighted average of the labels of the k nearest neighbors\n", " weighted_avg_label = np.average(self.y_data[ind[0]], weights=weights)\n", " predictions.append(weighted_avg_label)\n", "\n", " # Return the array of predictions we have created\n", " return np.array(predictions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Algorithm Evaluation\n", "\n", "Notice that the discussion so far has focussed on using a test set to evaluate a single model that was trained from data. This captures our uncertainty about the performance of the model that was learned. If we run the algorithm many times on different training sets, we could obtain models of different quality. The true MSE of each model could differ! Our analysis so far did not capture this.\n", "\n", "The analysis above is useful for testing how much you can trust a specific model, but less useful for comparing algorithms in general. To compare algorithms, we can do the following:\n", "- Specify a number of trials, `num_trials`\n", "- For each trial $i$ in $1,...,\\text{num\\_trials}$ do:\n", " - Sample a data set (ideally independent of the data sets for other trials)\n", " - Split the data set into training and testing sets\n", " - Use the ML algorithm to train a model on the training set.\n", " - Use the trained model to make predictions for the testing set.\n", " - Compute the sample performance metric (e.g., sample MSE) for the test set. Call this $Z_i$.\n", "- Compute and report the average sample MSE.\n", "- Compute and report the standard error of $Z_1,\\dotsc,Z_\\text{num\\_trials}$.\n", "\n", "This standard error incorporates uncertainty due to both the sample MSE and the varying MSE of the learned models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cross-Validation\n", "\n", "Notice that we can't easily do this using the GPA data set, since we can't generate `num_trials` indepent data sets (unless we consider data sets much smaller than our actual data set).\n", "\n", "Cross-validation is a technique that resolves this, by repeatedly splitting the same data set into different training and testing sets. The most common version is $k$-fold cross-validation, which operates as follows.\n", "\n", "- **Input:** Dataset `D`, Number of folds `k`, Machine Learning Algorithm `ML_Algo`\n", "- **Output:** Cross-validated performance estimate\n", "\n", "Procedure:\n", "\n", "1. Split `D` into `k` equal-sized subsets (folds) `F1, F2, ..., Fk`.\n", "2. For `i` from 1 to `k`:\n", " - Set aside fold `Fi` as the validation set, and combine the remaining `k-1` folds to form a training set.\n", " - Train the model `M` using `ML_Algo` on the `k-1` training folds.\n", " - Evaluate the performance of model `M` on the validation fold `Fi`. Store the performance metric `P_i`.\n", "3. Calculate the average of the performance metrics: `Average_Performance = mean(P_1, P_2, ..., P_k)`.\n", "4. Optionally, calculate other statistics (like standard deviation or standard error) of the performance metrics across the folds.\n", "\n", "One notable variant of k-fold cross-validation is **leave-one-out (LOO) cross-validation**, which sets `k` equal to the size of the data set so that each fold is a single point.\n", "\n", "Scikit-Learn has a useful function [KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html), which simplifies creating folds.\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average MSE: 0.571\n", "MSE Standard Error: ±0.004\n" ] } ], "source": [ "import pandas as pd\n", "from sklearn.model_selection import KFold, cross_val_score\n", "from sklearn.metrics import mean_squared_error\n", "import numpy as np\n", "\n", "# Load the data set\n", "df = pd.read_csv(\"data/GPA.csv\", delimiter=',')\n", "\n", "# We already loaded X and y, but do it again as a reminder\n", "X = df.iloc[:, :-1]\n", "y = df.iloc[:, -1]\n", "\n", "# Define the model\n", "model = WeightedKNearestNeighbors(k=300, sigma=100)\n", "\n", "# Choose number of folds for k-fold Cross-Validation\n", "k = 20\n", "kf = KFold(n_splits=k, shuffle=True, random_state=1)\n", "\n", "# Function to compute MSE for each fold\n", "def mse_for_fold(train_index, test_index, model, X, y):\n", " X_train, X_test = X.iloc[train_index], X.iloc[test_index]\n", " y_train, y_test = y.iloc[train_index], y.iloc[test_index]\n", " model.fit(X_train, y_train)\n", " predictions = model.predict(X_test)\n", " return mean_squared_error(y_test, predictions)\n", "\n", "# Compute MSE for each fold\n", "mse_scores = [mse_for_fold(train_index, test_index, model, X, y) for train_index, test_index in kf.split(X)]\n", "\n", "# Calculate the average MSE and standard error\n", "average_mse = np.mean(mse_scores)\n", "mse_standard_error = np.std(mse_scores, ddof=1) / np.sqrt(k)\n", "\n", "print(f\"Average MSE: {average_mse:.3f}\")\n", "print(f\"MSE Standard Error: ±{mse_standard_error:.3f}\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code below helps with visualizing how kfold is used." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "KFold(n_splits=20, random_state=1, shuffle=True)" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 10 44 45 ... 43267 43290 43296]\n", "MSE Score for this fold: 0.5807234989808185\n", "TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 40 93 134 ... 43246 43256 43261]\n", "MSE Score for this fold: 0.5630048290694765\n", "TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 3 23 34 ... 43262 43286 43288]\n", "MSE Score for this fold: 0.5553467010840363\n", "TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 8 19 25 ... 43277 43293 43299]\n", "MSE Score for this fold: 0.6129428000450592\n", "TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 11 33 58 ... 43255 43282 43292]\n", "MSE Score for this fold: 0.5933726084007112\n", "TRAIN: [ 0 1 3 ... 43300 43301 43302] TEST: [ 2 22 24 ... 43271 43284 43298]\n", "MSE Score for this fold: 0.5644141827789226\n", "TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 21 36 55 ... 43241 43249 43275]\n", "MSE Score for this fold: 0.573666751853279\n", "TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 26 46 62 ... 43223 43231 43265]\n", "MSE Score for this fold: 0.5764896819599702\n", "TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 29 35 43 ... 43229 43234 43242]\n", "MSE Score for this fold: 0.5443248559898528\n", "TRAIN: [ 1 2 3 ... 43300 43301 43302] TEST: [ 0 13 31 ... 43274 43278 43281]\n", "MSE Score for this fold: 0.5647024320496056\n", "TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 4 16 30 ... 43235 43266 43285]\n", "MSE Score for this fold: 0.6207282916497956\n", "TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 6 7 17 ... 43243 43244 43294]\n", "MSE Score for this fold: 0.5598859483455714\n", "TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 5 12 49 ... 43272 43273 43297]\n", "MSE Score for this fold: 0.574600925633249\n", "TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 50 60 70 ... 43233 43236 43268]\n", "MSE Score for this fold: 0.5668328172717993\n", "TRAIN: [ 0 2 3 ... 43299 43301 43302] TEST: [ 1 90 104 ... 43219 43269 43300]\n", "MSE Score for this fold: 0.5639934776750515\n", "TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 80 98 117 ... 43260 43276 43280]\n", "MSE Score for this fold: 0.5617020441188176\n", "TRAIN: [ 0 1 2 ... 43299 43300 43301] TEST: [ 39 48 86 ... 43201 43216 43302]\n", "MSE Score for this fold: 0.5813059809877908\n", "TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 9 51 59 ... 43257 43270 43283]\n", "MSE Score for this fold: 0.5601415521711817\n", "TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 14 41 47 ... 43264 43279 43295]\n", "MSE Score for this fold: 0.5377225184946909\n", "TRAIN: [ 0 1 2 ... 43299 43300 43302] TEST: [ 15 18 20 ... 43289 43291 43301]\n", "MSE Score for this fold: 0.5625223799224043\n" ] } ], "source": [ "display(kf)\n", "for train_index, test_index in kf.split(X):\n", " print(\"TRAIN:\", train_index, \"TEST:\", test_index)\n", " mse_score = mse_for_fold(train_index, test_index, model, X, y)\n", " print(\"MSE Score for this fold:\", mse_score)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 2 }